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We present an overview of our project of simulation of unquenched lattice QCD with optimal 
domain-wall quarks, using a GPU cluster currently constituting of 16 units of Nvidia Tesla S 1070 
plus 64 graphic cards with Nvidia GTX285 (total 128 GPUs with 128 Terafiops peak), attaining 
sustained computing power of 15.36 Terafiops. The first production run in two-flavor QCD is on- 
going, using the Iwasaki gauge action on a set of lattices with sizes 16 3 x (32,10,8, 6,4)x (16,32) 
at the lattice spacing a ~ 0. 1 fm, with eight sea quark masses down to m n ~ 200 MeV. We outline 
our simulation algorithm, and describe the present status of the production run. Preliminary 
results of pseudoscalar mass and decay constant are also presented. 
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1. Introduction 

Simulation of unquenched lattice QCD with exact chiral symmetry is a grand challenge among 
all sciences. Even for a modest 16 3 x 32 lattice with lattice spacing a ~ 0.1 fm, it often requires 
a supercomputer with peak computing power more than 50 Teraflops (e.g., 10 racks of IBM Blue- 
Gene/L). Therefore, only 2-3 lattice QCD groups around the world could afford to perform the 
simulation of unquenched lattice QCD with the domain- wall fermion [|IJ], or the overlap-Dirac 
fermion [Eh. (For a recent review of generating gauge ensembles of QCD, see Ref. [|f].) 

However, this scenario has been undergoing a dramatic change during the last 9 months. With 
the emergence of low-cost and computationally powerful Graphic Processing Unit (GPU), now 
plugging a graphic card with Nvidia GTX285 (240 cores, one Teraflops peak) into a PC immedi- 
ately turns the system into a powerful device, attaining a sustained 140 Gflops for the conjugate 
gradient (CG) with mixed precision. This implies that one can perform the Hybrid Monte Carlo 
(HMC) simulation of unquenched lattice QCD with exact chiral symmetry, using a single PC with 
a GTX285 graphic card, rather than a cluster of 128 nodes. Moreover, the cost of such a system (PC 
+ GPU card) could be less than $600. In terms of price/performance, a GPU cluster is only 1/100 
of a supercomputer (e.g., IBM BlueGene/L). Also, the power consumption of a GPU cluster is less 
than 1/10 of any supercomputer, for the same amount of (sustained) Teraflops. Evidently, these 
salient features of GPU have revolutionary impacts to lattice QCD as well as all computational 
sciences. For a recent review of QCD on GPUs, see Ref. 

The dynamical DWF project of TWQCD Collaboration was started around November 2007, 
soon after CUDA version 1.1 became available 1 . This was the right time for us to cut in since we 
could write our QCD code in CUDA rather than the much more tedious Cg [g]. We completed our 
first CUDA code around March 2008, which implemented the nested conjugate gradient with mixed 
precision to compute the overlap fermion propagator, in which the inner CG (i.e., the sign function 
times a vector) was computed with the two pass algorithm Note that Nvidia GTX280 had 

not been available at that time, thus all double precision operations were performed by the CPU, 
yet we still achieved 20 Gflops (sustained) using the Nvidia GTX8800 + Intel Q6600 (2.40 GHz). 

At lattice 2008, two groups reported using GPU for the Wilson-Dirac matrix multiplication 
with a vector [§, §], attaining ~ 100 Gflops with Nvidia GTX280. On the other hand, our aim had 
been to simulate unquenched lattice QCD with exact chiral symmetry, in which the major difficulty 
is how to preserve chiral symmetry to a high precision and also to sample all topological sectors 
ergodically. For lattice QCD with the conventional domain- wall fermion [|ll|], it suffers from a 
large chiral symmetry breaking (i.e., large residual mass) for N s = 16, which becomes problem- 
atic for finite temperature QCD. On the other hand, for lattice QCD with overlap-Dirac fermion, 
the action is discontinuous at the boundary between different topological sectors, and it is very 



costly to tunnel through the topological boundary using the refraction/reflection algorithm [10]. A 
plausible way to proceed is to fix the topology JT2|], then use the measured values of topological 
susceptibility to remove the artifacts in any physical observables due to the fixed topology in a finite 
volume. However, it has not been proved that suppressing the near zero modes (with the unphysical 
Wilson fermions and the associated twisted-mass ghosts) does not violate the ergodicity in a fixed 
topological sector. 



T.W.C. thanks Kelvin Chiu (then at Intel Corporation, CA) for his helpful suggestion. 
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Now we turn to the optimal domain-wall fermion. Unlike the conventional domain-wall 
fermion suffering from large chiral symmetry breaking, the optimal domain-wall fermion attains 
the (mathematically) maximal chiral symmetry by judiciously fixing the weight (O s for each layer 
in the fifth dimension, according to the formula derived in Ref. [13]. Then the effctive 4D Dirac 
operator is exactly equal to the overlap-Dirac operator with the sign function approximated by the 
Zolotarev optimal rational polynomial. 

The first step of our project was to develop the CUDA code for the CG of the 5D optimal 
domain- wall quark matrix, the most computationally intensive part in HMC. After tuning our 
CUDA code for 5 months, in January 2009, we achieved 120 Gflops for a single GTX280. 

The outline of this paper is as follows. In Section 2, we describe the hardware of our project. 
In Section 3, we outline our HMC simulation with the optimal domain-wall fermion. In Section 4, 
we briefly describe the GPU optimization of our CG code. In Section 5, we describe the present 
status of our production runs. In Section 6, we present preliminary results of pseudoscalar mass 
and decay constant. Finally, in Section 7, we conclude with some remarks. 



2. The hardware 

Our hardware is a GPU cluster currently constituting of 16 units of Nvidia Tesla S1070 plus 
64 Nvidia GTX285 graphic cards. Each unit of Nvidia Tesla S1070 contains 4 Tesla GPUs, 
each having 4 Gbyte RAM, while each GTX285 has 1 or 2 Gbyte RAM. Each unit of Tesla 
S1070 is connected to the PCI Express xl6 slots of a Tyan server through two interconnect ca- 
bles and two interface cards. Each Tyan Server has two Intel Quad-Core Xeon processors and 
32 GB RAM. For the GTX285 cards, every two GTX285 cards are plugged into the PCI Ex- 
press xl6 slots of a 2U server with one Intel i7 CPU and 12 Gbyte RAM. The hard disk stor- 
age of our system is over 300 Terabytes, which is managed through the Lustre cluster file sys- 
tem (http://www.sun.com/software/products/lustre/index.xml). The hardware cost of our system is 
around $220,000. The sustained computing power of our system attains 15.36 Teraflops for HMC 
simulation of lattice QCD with optimal domain-wall quarks. We use the Condor to manage the 
workload of our computing jobs (see http://www.cs.wisc.edu/condor/). 



3. Hybrid Monte Carlo simulation with optimal domain-wall quark 



The action of optimal domain-wall fermion [ |13[ ] is defined as 

N s 

S dwf = £ E^( ffl ^ + 1 )»' 5 M' + ( ffl ^ _1 )^( P +^V- 1+P - 5 ^+0}^' O- 1 ) 

s,s'=lx,x' 

= *D odw f¥ 

with boundary conditions P+Yx,0 = ~( m q/^ m o)P+Yx,N^ P-Wx,n s+ i = — ( m q/2m )P_\i/ x i, where 
P± = (1 ± 7s)/2. Here D w is the standard Wilson Dirac operator plus a negative parameter —iuq 
(0 < mo < 2), m q is the bare quark mass, and the weights {co s } are specified by the exact formula 
derived in Ref. [13]. The optimal domain- wall fermion operator D oc i w f can be transformed to 

1 1-L 1 



D opt =D w +M, M — 



L = P + L + + P L , 



(3.2) 
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where the CO denotes the diagonal matrix of the weights {co s ,s = 1, • • ,N S }, and 

, , f<V s 1, l<s<N s , , N + 

(L + ) s/ = \ 7 " ", L_ = (L + )t. (3.3) 

Obviously, M does not carry the color and 4D space-time indices, and it can be solved analytically. 
Since det(D op ,) only differs from det(D (X / w y-) by a factor independent of the gauge field, we can 
use D opt to perform the HMC simulation rather than D d w f. 

With even-odd preconditioning in the 4D space-time lattice, D opt can be written as 




D "P' [ n oe v I I D°JX- X 1 J I X-D°JX- l D e ° ' ' " ' " ( "' 4> 




where X = 4 - m +M. Since detD op , = det(l - D° e X- l D e H °X- 1 ) detX, and X does not depend on 
the gauge field, we can use the operator 

C=l-D°JX- 1 D e :!X- 1 (3.5) 

for the HMC simulation. After including the Pauli-Villars fields (with m q = 2mo), the pseudo- 
fermion action for 2-flavor QCD (m u = mj) can be written as 

S pf = ^cl v {CC)- l C PV <t>, (3.6) 

where Cpy = C(m q = 2mo). 

In the HMC simulation, we first generate random noise vector E, with Gaussian distribution, 
then we obtain = CpyCt, = Dt, with CG. With fixed 0, the system is evolved with a ficti- 
tuous Hamiltonian dynamics, the so-called molecular dynamics (MD). In the MD, we use the 
Omelyan integrator [14], and the Sexton-Weingarten multiple-time scale method [|l^]. The most 
time-consuming part in the MD is to compute 77 = (DD^) -1 ^ with CG, which is required for the 
evaluation of the fermion force in the equation of motion of the conjugate momentum of the gauge 
field. Thus we program the GPU (Nvidia T10 or GTX285) to compute T\ = (DD^)~ l ^>, using CG 
with mixed precision. Also, we have ported the computation of the gauge force and the gauge field 
update to the GPU. Our next step is to implement the computation of the fermion force entirely 
inside GPU. In some of our HMC simulations, we also introduce an extra heavy fermion field to 



reduce the condition number of the CG involving the light quark [16]. In this case, we can have 
both CPU and GPU compute concurrently, e.g., while the GPU is working on the CG of the light 
quark field, the CPU can compute the fermion force of the heavy fermion field. This asynchronous 
concurrent excecution mode enhances the overall performance by ~ 5%. 



4. GPU optimizations 

The Nvidia CUDA programming guide (see http://developer.download.nvidia.com) has pro- 
vided general guidelines to optimize an application code for CUDA. Here we outline some useful 
techniques in tuning our CUDA code for CG (with mixed precision) of the (optimal) domain-wall 
quark matrix. First, we have one thread take care of all computations at each (s,x,y,z), with a 
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loop going over all t (even/odd). The threads are divided into a one-dimensional grid with two- 
dimensional blocks. Each block has 64 threads, with N s threads in the X-direction and 64 /N s 
threads in the Y-direction. The number of blocks is equal to N s N x N y N z /64. At first, a naive tran- 
scription of our C code to CUDA only yielded 15 Gflops on a single GTX280. After putting the 
constant matrix X -1 (only involving the 5-th dimensional and Dirac indices) in the constant mem- 
ory space, and using the texture memory for link variables and vectors, we obtained an enhance- 
ment from 15 to 30 Gflops. Then we unrolled the short loops (involving Dirac and color indices) in 
our CG code. This gave a further enhancement from 30 to 50 Gflops. Next we reordered the data 
in the global memory such that the memory access by all threads of a half-wrap is coalesced into 
one memory transaction. This enhanced the performance from 50 to 80 Gflops. Then we reduced 
the number of local variables in our code, which gave another jump from 80 to 100 Gflops. Finally, 
we combined the threads by re-using forward/backward data (in t) for neighboring sites. This in- 
creased the performance from 100 to 120 Gflops. For Nvidia T10/GTX285, our CG code attains 
100/140 Gflops, where the difference is mainly due to the memory bandwidth and clock frequency 
of the GPU. For the T10 processor with 4 GB device memory, it can only accommodate a lattice 
up to the size 24 3 x 48 x 16. In order to simulate a larger lattice, we are porting our CG CUDA 
code to multi-GPUs under MPI and OpenMP. 

5. Current status of production runs 

The first production run in two-flavor QCD is on-going, using the Iwasaki gauge action on 
a set of lattices with sizes 16 3 x (32,10,8,6,4) x (16,32) at the lattice spacing a ~ 0.1 fm (/3 = 
2.20), with eight sea quark masses ranging from m q a = 0.01,0.02, • • • ,0.08. After discarding 300 
trajectories for thermalization, we are accumulating 5120 trajectories in total for each sea quark 
mass. Using the binning method, we estimate the autocorrelation time to be around 20, based 
on the saturation of the jackknife error of the plaquette as a function of the bin size, as shown 
in Fig. [|(a). Thus we sample one configuration every 20 accepted trajectories, and we have 256 
configurations for each m q . The topological charge distribution of 66 configurations for m q = 0.02 
is plotted in Fig. [j](b). Currently, we are completing the statistics for each gauge ensemble, and we 
expect to have the first physics results from the ensemble with the lattice size 16 3 x 32 x 16. 

6. Preliminary results of pseudoscalar mass and decay constant 

In this section, we present some preliminary results of the pseudoscalar mass and decay con- 
stant, for the gauge configurations generated with 2-flavor QCD with optimal domain-wall quarks 
and the Iwasaki gluon action at j8 = 2.20, on the 16 3 x 32 x 16 lattice. The weights {co s } are fixed 
such that the error of the sign function S opt (H w ) is less than 10~ 5 . In Fig. ||, we plot m\/m q and f K 
versus m q respectively. The statistics for each quark mass is about 100-150 configurations. We fit 
our results to the next-to-leading order (NLO) formulas in chiral perturbation theory (ChPT) JT7| ] 




x = 




4B 



(6.1) 



f% = /(I -x\nx) + c 4 x. 



(6.2) 
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Figure 1: (a) Estimation of the autocorrelation time using the binning method. The jackknife error of the 
plaquette is measured from 4000 accepted trajectories in the HMC simulation of 2-flavor QCD with optimal 
domain-wall quarks (m q a = 0.02) on the 16 3 x 32 x 16 lattice, (b) The topological charge distribution of 66 
configurations which are sampled every 20 accepted trajectories. 




Using the fitted value of fa and the physical input f % = 131 MeV, we obtain cT x = 1.5(1) GeV, a 
rough estimate of the lattice spacing. From the fitted value of B, we obtain the chiral condensate 
£ = Bf 2 /2 = [261(20)MeV] 3 . The complete analysis with the statistics of 250 configurations for 
each quark mass will be presented in a forthcoming paper. 

7. Concluding remark 

This is the first project to use a GPU cluster to perform a large-scale simulation of lattice QCD 
with exact chiral symmetry. Our GPU cluster is less than 1/100 in price/performance ratio and only 
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1/10 of the power consumption of a conventional supercomputer. Using the optimal domain-wall 
fermion, we can simulate unquenched lattice QCD with exact chiral symmetry, without fixing the 
topology. The cost of preserving chiral symmetry to a high precision is counterbalanced by the 
low price/performance ratio of a GPU. The production runs for 2-flavor QCD on the 16 3 x 32 x 16 
lattice will be completed by the end of 2009. For the (2+l)-flavor QCD on the 16 3 x 32 x 16 
lattice, we simulate the strange quark with our newly constructed positive-definite pseudofermion 



action for a single fermion flavor []18j]. The thermalization of this set of gauge ensembles is now in 
progress. 
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